4  Select DEGs

4.1 Performing differential analysis using DESeq2

DESeq2(Differential Expression using DESeq2)

Principle: DESeq2 is based on a negative binomial distribution model, taking into account the discreteness and variability of gene expression data, as well as the impact of library size differences on differential analysis. DESeq2 reduces technical variation between samples through normalization transformations and scaling factors, then estimates the dispersion of gene expression. It employs the negative binomial distribution model to identify differentially expressed genes and corrects for multiple testing issues. DESeq2 is suitable for small sample RNA sequencing data and performs particularly well with a small number of samples, effectively handling inter-sample variability, technical noise, and batch effects.

Input Data Type: DESeq2 is designed for RNA-seq data and requires count data for each gene. It typically uses raw count matrices without the need for pre-normalization or scaling. DESeq2 requires the input data to be a matrix composed of integers. In addition to importing counts, DESeq2 officially provides a method to directly import txi objects generated by tximport if salmon was used upstream.

DEG_deseq2 <- DESeq2_analyze(
    tumor_file = "../test_TransProR/generated_data1/removebatch_SKCM_Skin_TCGA_exp_tumor.rds", 
    normal_file = '../test_TransProR/generated_data1/removebatch_SKCM_Skin_Normal_TCGA_GTEX_count.rds',
    output_file = '../test_TransProR/Select DEGs/DEG_deseq2.Rdata',
    logFC = 2.5,
    p_value = 0.01
)
Group Table:
group
normal  tumor 
   811    470 
converting counts to integer mode
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 4831 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
Change Table:

  down stable     up 
  3824  48356   4199 
head(DEG_deseq2, 5)
          baseMean log2FoldChange      lfcSE      stat pvalue padj change
A1BG      134.6084      -2.549682 0.05677219 -44.91075      0    0   down
A2M     30208.9912       3.251168 0.08394645  38.72907      0    0     up
AADACL2   801.4538      -8.229956 0.18969649 -43.38486      0    0   down
AARS2    1153.5493       1.624753 0.03283522  49.48202      0    0 stable
AARSD1    567.8672      -2.082289 0.02275703 -91.50088      0    0 stable

4.2 Performing differential analysis using edgeR

edgeR(Empirical Analysis of Digital Gene Expression Data in R)

Principle: Similar to DESeq2, edgeR also accounts for the dispersion in data. It utilizes a negative binomial distribution model and different normalization methods, such as TMM (Trimmed Mean of M values), to address technical variations between samples. edgeR employs a hypothesis-testing framework to identify differentially expressed genes and uses multiple testing correction akin to the Benjamini-Hochberg method. edgeR performs better with a larger number of samples and is suitable for medium-scale RNA sequencing data, offering high sensitivity and accuracy.

Input Data Type: edgeR is also applicable to RNA-seq data, requiring expression counts for each gene. It uses the raw count matrix and does not require normalization or standardization. When using edgeR, it is important to select the appropriate analysis algorithm; the official recommendation for bulk RNA-seq is the quasi-likelihood (QL) F-test tests, and for scRNA-seq or data without replicates, the likelihood ratio test is suggested.

DEG_edgeR <- edgeR_analyze(
    tumor_file = "../test_TransProR/generated_data1/removebatch_SKCM_Skin_TCGA_exp_tumor.rds", 
    normal_file = '../test_TransProR/generated_data1/removebatch_SKCM_Skin_Normal_TCGA_GTEX_count.rds',
    output_file = '../test_TransProR/Select DEGs/DEG_edgeR.Rdata',
    logFC_threshold = 2.5,
    p_value_threshold = 0.01
)
Group Table:
group
normal  tumor 
   811    470 
Change Table:

  down stable     up 
  2984  20294   3842 
head(DEG_edgeR, 5)
                  logFC   logCPM        LR PValue FDR change
CH507-513H4.4 -16.46830 8.739277  9184.402      0   0   down
CH507-513H4.3 -16.18797 8.739279  8829.571      0   0   down
XBP1          -15.00117 5.932264 40023.604      0   0   down
CH507-513H4.6 -14.81025 8.739297  7042.096      0   0   down
KREMEN1       -14.45332 5.384819 22028.487      0   0   down

4.3 Performing differential analysis using limma

limma(Linear Models for Microarray Data)

Principle: Initially developed for gene chip data, limma was later adapted for RNA sequencing data. It is based on linear models and employs weighted least squares for estimating differences in gene expression, and a Bayesian approach for correcting multiple testing issues. Limma excels in handling large-scale data and is suitable for high-throughput data analysis, such as chip and large-scale RNA sequencing data, effectively controlling the false positive rate.

Input Data Type: limma is applicable to various types of high-throughput data, including chip data and RNA-seq. It requires the expression values for each gene, which can be either raw counts or already normalized values. There are two methods for differential analysis in limma: limma-trend and voom. The difference between these two methods is minimal when the sequencing depth of samples is similar, but voom has advantages when there is a significant difference in sequencing depth, hence voom is generally the preferred method for differential analysis.

DEG_limma_voom <- limma_analyze(
    tumor_file = "../test_TransProR/generated_data1/removebatch_SKCM_Skin_TCGA_exp_tumor.rds", 
    normal_file = '../test_TransProR/generated_data1/removebatch_SKCM_Skin_Normal_TCGA_GTEX_count.rds',
    output_file = '../test_TransProR/Select DEGs/DEG_limma_voom.Rdata',
    logFC_threshold = 2.5,
    p_value_threshold = 0.01
)
Group Table:
group
normal  tumor 
   811    470 
Change Table:

  down stable     up 
  1874  16990   1611 
head(DEG_limma_voom, 5)
             logFC    AveExpr         t P.Value adj.P.Val        B change
TOMM6   -11.347906  1.6560534 -319.5710       0         0 2790.496   down
EEF1G   -11.484561  6.6305909 -218.5213       0         0 2324.291   down
XBP1    -12.063108  2.1101301 -192.0646       0         0 2162.558   down
ARL6IP4  -5.415435  4.6609716 -189.5647       0         0 2149.674   down
CMC4     -7.768105 -0.5940984 -187.1232       0         0 2129.708   down

4.4 Wilcoxon Rank-Sum Test

Experiments indicate that the Wilcoxon rank-sum test, as a non-parametric test, is not recommended when the sample size per condition is less than eight. However, once the sample size per condition exceeds eight, the Wilcoxon rank-sum test performs comparably or better than the three parametric testing methods (DESeq2, edgeR, and limma-voom) or other non-parametric testing methods.

The three parametric testing methods DESeq2, edgeR, and limma were all initially developed to address the problem of analysis with small sample sizes. In population-level studies, which typically have larger sample sizes (at least several dozen), parametric assumptions are often unnecessary. Additionally, larger sample sizes are more likely to include several outliers that, when they violate assumptions, can lead to unstable p-values and consequently invalidate FDR.

Unlike DESeq2, edgeR, and limma, the Wilcoxon rank-sum test, as a non-regression method, cannot adjust for various potential confounding factors (such as differences in sequencing depth). To use the Wilcoxon rank-sum test, it is generally recommended that RNA-seq samples first be normalized to eliminate batch effects before computation.

Note: The method selected for function implementation in this section is: edgeR TMM + wilcox.test analysis

outRst <- Wilcoxon_analyze(
    tumor_file = "../test_TransProR/generated_data1/removebatch_SKCM_Skin_TCGA_exp_tumor.rds", 
    normal_file = '../test_TransProR/generated_data1/removebatch_SKCM_Skin_Normal_TCGA_GTEX_count.rds',
    output_file = '../test_TransProR/Select DEGs/Wilcoxon_rank_sum_testoutRst.Rdata',
    logFC_threshold = 2.5,
    fdr_threshold = 0.01
)
Group Table:
group
normal  tumor 
   811    470 
Change Table:

  down stable     up 
  2252  16954   1269 
head(outRst, 5)
         log2foldChange       pValues           FDR change
A1BG        -2.51276182 1.483409e-169 2.270014e-168   down
A1BG-AS1    -0.03966404  2.557352e-09  3.168066e-09 stable
A2M          3.39825855 2.364713e-144 1.946823e-143     up
A2M-AS1      1.70962233  2.001256e-47  3.931655e-47 stable
A2ML1       -5.14816236 3.597150e-115 1.712431e-114   down

4.5 four degs venn

all_degs_venn <- list(
    DESeq2 = deg_filter(DEG_deseq2), 
    edgeR = deg_filter(DEG_edgeR), 
    limma = deg_filter(DEG_limma_voom), 
    Wilcoxon_test = deg_filter(outRst)
)
four_degs_venn <- four_degs_venn(all_degs_venn)
four_degs_venn

4.6 Reference

  • DESeq2:

Differential gene expression analysis based on the negative binomial distribution

Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.

  • edgeR:

edgeR: differential analysis of sequence read count data

Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.

  • limma:

Linear Models for Microarray and RNA-Seq Data User’s Guide

Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7),e47-e47.

  • Wilcoxon Rank-Sum Test:

Li, Y., Ge, X., Peng, F., Li, W., & Li, J. J. (2022). Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biology, 23(1), 79.